Graphical perception

When we (the designers) visualize data, we encode the quantitative information in shapes, color, position, etc. The viewers then have to decode that information. Cleveland and McGill studied what people are able to decode most accurately and ranked them in the following list.

1. Position along a common scale e.g. scatter plot  
2. Position on identical but nonaligned scales e.g. multiple scatter plots  
3. Length e.g. bar chart  
4. Angle & Slope (tie) e.g. pie chart  
5. Area e.g. bubbles
6. Volume, density, and color saturation (tie) e.g. heatmap  
7. Color hue e.g. newsmap (quantitative data)

See ppt#7 Perception.pdf

Building blocks of a graph include:
1. data
2. aesthetic mapping aes()
- position(x-axis,y-axis)
- color
- fill
- shape (of points)
- linetype
- size
3. geometric object
- histogram: geom_histogram
- points: geom_point for scatter plots, dot plots, etc
- lines: geom_line for time series, trend lines, etc
- boxplot:geom_boxplot
- text: geom_text
4. statistical transformations
5. scales: one scale per mapping
6. coordinate system
7. position adjustments
8. faceting

layer

layer

scale

scale

Continuous Variables

Histogram

  • One of options for displaying continuous data.
    • data’s empirical distribution within a set of intervals
    • default bin=30
  • Displays the histograms with equal scales and binwidths to better compare features of variables.

  • What density estimates show depends greatly on the bandwidth used

  • Histogram or density curve:
    • modes/peaks : how many modes or peaks (multimodality)

\[\text{Density}=\frac{\text{RelFreq}}{\text{Binwidth}}\]

  • up and down: increase binwidth or decrease bin nums

  • center

  • boundary

Boxplot

The boxplot is a compact distributional summary, displaying less detail than a histogram or kernel density, but also taking up less space. Boxplots use robust summary statistics that are always located at actual data points, are quickly computable (originally by hand), and have no tuning parameters. They are particularly useful for comparing distributions across groups. - Hadley Wickham

NOT FOR CATEGORICAL VARIABLES (not for discrete data either!)

  • right skewed; outliers

Boxplot is best to compare distributions by subgroups

  • Boxplot by group: must have the same scale and could be drawn with their width a function of the size of the group

  • Boxplot of different variabls: have different scales and each case appears in each boxplot (no need to consider different widths)

  • disadvantages: can not detect if the distribution is not unimodal

    • can draw an accompanying barchart for the numbers in each group

Outliers: over 1.5 times the box length away from the box

Extreme outliers: over 3 times the box length away from the box

  • can be reordered by statistics (median, sd, mean) reorder()

Normality

  • Q-Q plots: to look at normality
    • x-axis: theoretical quantiles: standard normal distribution with mean 0 and standard deviation 1 \(N(0,1)\)
    • y-axis: sample distribution


  • Shapiro-Wilk test: shapiro.test to test normality.
    • Null hypothesis: normal
    • p-value < 0.05, reject null: not normal

Multiple Continuous Variables

Modelling and testing for relationships between variables
1. Correlation - Correlation cofficients
2. Regression
3. Smoothing
- loess(local weighted regression)
- spline function
4. Bivariate density estimation
- kde2d; kde; bkde2d
5. Outliers

Bivariate Continuous Data

Scatterplot

  • revealing associations between variables, not just linear associations, but any kind of association
  • identifying outliers and for spotting distributional features (still need to look at univariate distributions)

Features that might be visible in scatterplots:
1. Causal relationships (linear and nonlinear): correlation \(\ne\) causation
2. Association (without being directly causally related)
3. Outliers or groups of outliers
4. Clusters: to assess the possibility of clutering, consider a density estimate.
5. Gaps
6. Barriers (boundaries)
7. Conditional relationship (different relationships for different intervals of x)
i. e.g. a plot of income against age is likely to differ before and after retirement age

  • add lines or smooths: if you think there is a linear causal relationship

  • Comparing groups within scatterplots
    • facet_wrap(~var)
  • Scatterplot matrices for looking at many pairs of variables
    • ggpairs; splom; spm
  • Overlapping data: consider alpha blending or jittering
    • alpha=0.5
    • position="jitter"
  • Contour lines: give a sense of the density of the data
    • geom_density_2d()
  • Scatterplot matrices are valuable for identifying bivariate patterns even with quite a ew variables

Multivariate Continuous Data

Modelling and testing for multivariate continuous data
1. Outliers
- interactive graphics are the best approach
2. Regular gaps in the distribution of a variable
3. Clusters of cases
4. Separated groups: whether this means anything depends on the values those cases take on other varialbes, especially categorical ones
- linear models cane be useful in assessing the features

Parallel Coordinate Plot

  • polular for multivariate continuous data

  • ggparcoord

  • parallel coordinate plots can include axes for categorical variable as well

When to use:
- to detect a general trend that data follows, and also the specific cases that are outliers
- not a ideal graph to use when there are just categorical variables involved
- to identify trends in specific clusters
- highlight each cluster in a different color using the groupColumn
- graphing time series data - information stored at regular time intervals

Scales
- std: default value, where it subtracts mean and divides by SD
- robust: subtract median and divide by median absolute deviation
- uniminmax: scale all values so that the minimum is at 0 and maximum at 1
\[y_{ij}=\frac{x_{ij}-min_ix_{ij}}{max_ix_{ij}-min_ix_{ij}}\] - globalminmax: no scaling, original values taken
- center: centers each variable according to the value given in scaleSummary
- centerObs: centers each variable according to the value of the observation given in centerObsID

  • Splines: when there are a plot of repeating values
    • The case lines become more and more curved when we set a higher spline factor, which removes noise and makes for easier observations of trends.
    • splineFactor=10
  • Features:
    1. identify many different multivariate features in parallel coordinate plots
    2. identify outliers, correlations, clusters
    3. quick overviews of univariate distributions for several variables at once
      • skew, outliers, gaps, concentrations
    4. bivariate associations between adjacent variables
  • Parallel coordinate plots are useful for studying groups of cases and are most effective when they are interactive

Categorical Data

Single categorical variables, nominal, ordinal, or discrete.

Graphics: Barchart and Piecharts

Features:      
  1. Unexpected patterns of results    
    * many more of some categories than others    
    * some may be missing completely    
  2. Uneven distributions  
    * bias, domiated by some major trials....     
  3. Extra categories     
    * M and F, but also m and f, male and female    
  4. Unbalanced experiments      
    * missing or unusable    
  5. Large numbers of categories    
  6. many others: refusals, errors, missings...     

Modeling and testing for categorical variables
1. Testing by simulation
I. \(\chi^2\) test
2. Evenness of distribution
I. \(\chi^2\) test: the null hypothesis of equally likely probabilities if a random number generator is to be checked directly
3. Fitting discrete distribution
I. \(\chi^2\) test: inspect any lack of fit

null,alternative

null,alternative

Bar Chart

  • If we have a table of values and want to plot them as explicit bar heights, specify stat="identity" in geom_bar(). If each row is one observation and want to be grouped into bars, no need to do this.

Cleveland Dot Plot

  • A great alternative to a simple bar chart

  • R built-in base function dotchart(), or geom_point in ggplot

Multivariate Categorical Data

Frequency:
- bar charts
- cleveland dot plots
Proportion/Association:
- Mosaic plots
- Fluctation diagrams

Modelling and testing for multivariate categorical data
1. Contingency tables
- The standard for checking the association of two categorical variables is the \(\chi^2\) test.
2. Associations between categorical variables
3. Binary dependent variables

Mosaic Plot

  • The graphics area is divided up into rectangles proportional in size to the counts of the combinations they represent.
  • Lengths are easier to judge and compare than areas, so it is best to use displays where each rectangle has the same width or the same height.

  • horizontal split into rectangle, then vertical split, then horizontal split

Order of Splits

Favorite ~ Age + Music: Age first, mustic next, favorite last.
- independent variables will be split first, then to split dependent variable

  • dependent variables is split last and split horizontally(default)
  • fill is set to dependent variable
  • other variables are split vertically
  • most important level of dependent variable is closest to the x-axis and darkest(or most noticable shade)
Direction of Splits

Default:
- Age: h
- Music: v
- Favorite: h

  • Changed to ("v","v","h"): Age v, Music v, Favorite h

  • No gap = most efficient

  • Taller thinner rectangles are better

  • The construction of mosaicplots is hierarchical and the order of variables has a big impact on the display.
  • With a large number of combinations, no mosaicplot is likely to work.

  • Mosaicpots can be very helpful for displaying raw data and they can also be used to support modelling.

  • Doubledecker plot: good for comparing rates for a binary dependent variable across all possible groupings

  • Choice of ordering of a variable categories
    • Ordinal variables must be kept in the correct sequence, either increasing or decreasing.
    • Nominal variabls if no sensible default ordering then ordering by frequency
  • Aspect Ratio determines what can be seen in a plot and various sizes…
  • color
  • label

  • no relationship example (right side)

  • deterministic relationship example

Heatmap

  • geom_bin2d()
  • geom_tile() to graph all cells in the dataframe and color them by their value
  • geom_hex(): hexagonal bins
  • like a combination of scatterplots and histograms: allow to compare different parameters while also seeing their relative distributions

  • can show frequency counts (2D histogram) or value of a third variable
  • can be used for continuous or categorical data (both for axes and fill color)

Fluctuation Diagrams

  • good for representing large contingency tables or transition matrices, where there is no reason to differentiate between the row variable and the column variable
  • good for detect categorical outliers
  • also helpful for identifying clusters in the data
  • fluctile()
    • the boxes reflect the count for the combination of the categories
  • good to show symmetry or asymmetry of data

Likert

strongly agree, agree, don’t known(neutral), disagree, strongly disagree

Time Series

Special Features of time series
1. Data definitions
- Dependence on time
- Have a given order, and individual values are not independent of one another
2. Length of time series
- Time series can be short (annual sales) or long (every minute)
- Sometimes the short-term details of long series can obscure long-term trends, sometimes are of particular interest
- Series on different time scales can be informative
- For longer periods, consider taking final value, value of middle of the period, average value, or weighted average…..or add smooth….or many other options
3. Regular and irregular time series
- regular time series: equally spaced time points. e.g., hourly data, daily data, yearly data….
- irregular time series: e.g., patient’s temperature or blood pressure; political opinion polls are more frequent near elections than at other times
4. Time series of different kinds of variables
- Most are assumed to be of continuous variables, but still have nominal, discrete…
5. Outliers
- not necessarily extreme values for the whole series, maybe unusual in relation to the pattern around
- scales for time series are unusual…
- “The usual principle applies that it is best to draw several displays, zooming in to inspect details that may otherwise be hidden and zooming out to see the overall context”
6. Forecasting
- Two main reasons for studying time series:
- to try to understand the patterns of the past
- to try to forecast the future
7. Seeing patterns

Scale

df %>% group_by(symbol)%>%
mutate(rescaled_close = 100*close / close[1])

rescale the stock price for each symbol(company) to 100 grouping by symbol.

Secular Trend 长期趋势

Overall long-term trend.
geom_smooth() function

  - Loess Smoother
  
        A loess smoother does not assume any model

        loess: Locally estimated scatterplot smoothing   

        - non-parametric regression  
        
        - not to specify a global function of any form to fit a model to the data, only to fit segments of the data.
        
        - A smooth curve of set of data points obtained with this statistical technique is called a loess curve. 
        
        - Advantages: does not require the specification of a function to fit a model to all of the data in the sample; ideal for modeling complex processes for which no theoretical models exist.
        
        - Disadvantages: less efficient use of data than other least squares methods; require fairly large, densely sampled data sets in order to produce good models; does not produce a regression function that is easily represented by a mathematical formula; 
        
  • Different smoothing parameters span=0.5

  • Default smoothing parameters 0.75

  • smoothing parameter 0.1

  • smaller smoothing parameter, more fluctuate?

Frequency of Data

  • geom_point() in addition to geom_line()
    • using geom_point() with geom_line() is one way to detect missing values.
  • leave gaps???

Outliers

Higher dimensional outliers may be not outliers in lower dimensions.

univariate outliers

In boxplot, suggested that more than 1.5 IQR(the interquartile range) outside the hinges(the quartiles). \[\begin{aligned} \text{outliers} < Q1-1.5IQR\\ \text{outliers} > Q3+1.5IQR \end{aligned}\]

Outliers may change if they are grouped by another variable.

multivariate outliers

Scatterplots and parallel coordinate plots are useful for visualizing multivariate outliers. You could regard points as outliers that are far from the mass of the data, or you could regard points as outliers that do not fit the smooth model well.

  • by fitting a smooth model, outliers are the points that do not fit the smooth model:

  • adding density estimation and loess smoother can help

  • parallel coordinate plot to detect outliers

categorical outliers

  • It is more difficult to find categorical outliers than continuous outliers

  • Fluctuation diagrams can be used to find categorical outliers.

handling outliers

A strategy for dealing with outliers is as follows

1. Plot the one-dimensional distributions of the variables using boxplots. Examine any extreme outliers to see if they are rare values or errors and decide if they should be removed or imputed.

2. For outliers which are extreme on one dimension, examine their values on other dimensions to decide whether they should be discarded or not. Discard values that are outliers on more than one dimension.

3. Consider cases which are outliers in a higher dimensions but not in lower dimensions. Decide whether they are errors or not and consider discarding or imputing the errors.

4. Plot boxplots and parallel coordinate plots by using grouping on a variable to find outliers in subsets of the data.
  • Some statistics are little affected by outliers, e.g., medians
    While some are affected greatly, e.g., mean, scale….

  • Two alnative extremes: Either keeping it or discarding it

  • In modelling, robust methods attempt to reduce the effect of outliers by calculating a weighting for each case

  • Individual extreme values are easy to spot, groups of outliers are more difficult to determine (can be caused by many different reasons)

  • Graphical displays are useful for finding univariate outliers and bivariate ones

  • Sploms and parallel coordinate plots can be helpful for studying potential outliers

Tidy Data

Only table1 is a tidy data

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

table2
#> # A tibble: 12 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> # … with 6 more rows

table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583

# Spread across two tibbles
table4a  # cases
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

table4b  # population
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583
There are three interrelated rules which make a dataset tidy:    
1. Each variable must have its own column.  
2. Each observation must have its own row.  
3. Each value must have its own cell.  

transforming data

  • rename

  • transposing data frames
    • t(): converts to a matrix; convert numerical to character if there are any non-numerical columns; works best if data frame has row names which will become column names
    • gather() and spread()
  • transforming multiple columns
    mutate_all,mutate_if,mutate_at

github

pull from GitHub master to Local master and push from Local master to GitHub master

pull from GitHub master to Local master and push from Local master to GitHub master

  1. pull: pull down any changes made to the repo on GitHub by clicking the Down Arrow in the Git pane of RStudio
  2. WORK
  3. Commit/Push: click Commit button->enter a commit message->clicking Up Arrow to send the commit to GitHub

Branching

branching

branching

Workflow
1. Creating repo on GitHub (original/master)
2. Clone it once to local master (Make a local copy, next time this step will be a PULL)
3. Create a branch to do your new work(Local branch1)
4. Commit and Push the new branch to GitHub to be reviewed (to GitHub origin/branch1)
5. Submit a pull request, then someone else merges your changes into master (GitHub origin/master)
- merging a pull request: the original author or someone else can do it; important is that communicate with your collaborators and decide how to manage the pull request
6. Your branch is deleted and the new stuff is pulled into your copy of the master branch
- delete the branch locally: git branch -d <branchname>->git fetch -p(stop tracking remote branch)


Local master is the last to know!!

contribute to someone else’s repo

Types of repositories(from your perspective)
- local repository: resides on your computer
- remote repository: resides somewhere else
- origin: the repo that you created or forked on GitHub
- upstream: the original repo of the project that you forked (if you didn’t create it)

  1. Begin by forking another repo on GitHub rather than creating your own
  2. Main challenge: keeping your code up-to-date with upstream

\(1^{st}\) PR

The workflow\(1^{st}\) PR
1. fork repo (once)
2. clone repo (once)
3. cofigure a remote that points to the upstream repository (once)
>git remote add upstream https://...
4. branch
5. work,commit/push
6. submit pull request
7. wait for PR to be merged

\(2^{nd},3^{rd}\),…PR

  1. sync local master with upstream master
    >git fetch upstream
    >git checkout master
    >git merge upstream/master
  2. delete old branch(es)
  1. delete branch on GitHub
  2. >git branch -d <branchname>
  3. >git fetch -p
  1. push to update origin/master
  2. push changes up to origin/master(GitHub)…. don’t understand!